Search CORE

27 research outputs found

PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison

Author: La Cava William
Moore Jason H.
Olson Randal S.
Orzechowski Patryk
Urbanowicz Ryan J.
Publication venue
Publication date: 01/03/2017
Field of study

The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. This work is an important first step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.Comment: 14 pages, 5 figures, submitted for review to JML

arXiv.org e-Print Archive

Directory of Open Access Journals

Automating biomedical data science through tree-based pipeline optimization

Author: Andrews Peter C.
Kidd La Creis
Lavender Nicole A.
Moore Jason H.
Olson Randal S.
Urbanowicz Ryan J.
Publication venue
Publication date: 27/01/2016
Field of study

Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding

arXiv.org e-Print Archive

Scipedia

Predicting the Difficulty of Pure, Strict, Epistatic Models: Metrics for Simulated Model Selection

Author: Fisher Jonathan M
Kiralis Jeff
Moore Jason H
Urbanowicz Ryan J
Publication venue: Dartmouth Digital Commons
Publication date: 01/09/2012
Field of study

Background: Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection. Results: We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability. Conclusions: This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Dartmouth Digital Commons (Dartmouth College)

Learning Classifier Systems: A Complete Introduction, Review, and Roadmap

Author: Jason H. Moore
Ryan J. Urbanowicz
Publication venue: 'Hindawi Limited'
Publication date
Field of study

Crossref

A Classification and Characterization of Two-Locus, Pure, Strict, Epistatic Models for Simulation and Detection

Author: Granizo-Mackenzie Ambrose L. S.
Kiralis Jeff
Moore Jason H
Urbanowicz Ryan J
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/2014
Field of study

BackgroundThe statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models

Springer - Publisher Connector

PubMed Central

Dartmouth Digital Commons (Dartmouth College)

A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection

Author: Ambrose LS Granizo-Mackenzie
B McKinney
C Barber
C Greene
D Shriner
E Brodie III
E Eichler
H Cordell
H Cordell
IB Hallgrímsdóttir
J Cheverud
J Moore
J Moore
J Moore
J Moore
J Moore
J Rambau
Jason H Moore
Jeff Kiralis
LW Hahn
M Wade
N Beerenwinkel
P Phillips
R Culverhouse
R Fisher
R Neuman
RJ Urbanowicz
RJ Urbanowicz
Ryan J Urbanowicz
W Bateson
W Frankel
W Kruskal
W Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Rule-based machine learning classification and knowledge discovery for complex problems

Author: Ryan J. Urbanowicz
Urbanowicz Ryan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref